January, 2018

Introduction

  • Day 1 - Getting started
  • Day 2 - Functions & Spark
  • Day 3 - Tidyverse
  • Day 4 - Plotly
  • Day 5 - Shiny Introduction
  • Day 6 - Reactivity
  • Day 7 - Modules
  • Day 8 - Shiny Project

Day 3 - Tidyverse

Day 3 - Agenda

  • select
  • filter
  • arrange
  • mutate
  • summarise
  • group_by
  • %>% (pipe)

select

It’s not uncommon to get data sets with hundreds or even thousands of variables. In this case, the first challenge is often narrowing in on the variables you’re actually interested in. select() allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.

select(diamonds, cut, color, carat, price)
select(diamonds, x:z)
select(diamonds, -(x:z))
select(diamonds, starts_with("c"))
select(diamonds, ends_with("e"))
select(diamonds, contains("r"))

TIP: Move sorting variables to the start of the data frame and only keep the important variables. Variables can be renamed at the same time.

filter

filter() allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame.

filter(diamonds, cut=="Ideal")
filter(diamonds, cut!="Ideal")
filter(diamonds, carat>=4) # <, >, ==, !=, <=, >=
filter(diamonds, cut=="Ideal" & carat>=4 )
filter(diamonds, cut=="Ideal" | carat>=4 )
filter(diamonds, cut %in% c("Ideal","Premium"))

sqrt(2)^2 == 2
near(sqrt(2)^2, 2)

arrange

arrange() works similarly to filter() except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.

arrange(diamonds, cut) #A-Z
arrange(diamonds, desc(cut)) #Z-A

arrange(diamonds, price) #Small to large
arrange(diamonds, desc(cprice)) #Large to small

arrange(diamonds, cut, desc(price)) #by two or more variables

mutate

Besides selecting sets of existing columns, it’s often useful to add new columns that are functions of existing columns. That’s the job of mutate().

  • Arithmetic: +, -, *, /, ^
  • Modular: %/% (integer division), %% (remainder), x == y * (x %/% y) + (x %% y)
  • Logs: log(), log2(), logn()
  • Offsets: lead(), lag()
  • Cumulatives: cumsum(), cumprod(), cummin(), cummax() See RcppRoll package for more.
  • Logical: <, <=, >, >=, !=, ==
  • Ranking: min_rank(), row_number(), dense_rank(), percent_rank(), cume_dist(), ntile()
  • User defined: function(){} – should be a vectorised function

mutate

TIP: Arithmetic operators are useful in conjunction with aggregate functions, e.g. X/sum(X) gives the proportion, and Y-mean(Y) computes the difference from the mean.
TIP: Offsets allows you to compute running differences (e.g. x-lag(x)) or find when values change (X != lag(X)) They are most useful in conjunction with group_by(), but make sure to sort first using arrange().

mutate(
  diamonds,
  price_p_carat = price / carat,
  diff = price_p_carat - mean(price_p_carat),
  z_score = diff / sd(price_p_carat)
)

summarise

The last key verb is summarise(). It collapses a data frame to a single row. summarise() is not terribly useful unless we pair it with group_by().

  • TIP: There are many built in functions so don’t reinvent the wheel.
  • TIP: The result of a summary can be used directly in the next step to calculate other statistics.
  • WARNING: Remember when calculating statistics that the result is not always as you would expect, e.g. mean() returns the straight average not the weighted average.
  • WARNING: Always check the documentation before using built in functions to know what options there are and what the default options are. It is important to understand exactly what you are calculating.

summarise

summarise(
  diamonds,
  N = n(),
  sum = sum(price),
  ave1 = sum / N,
  SSD = sum( (price - mean(price)) ^2),
  SD = sqrt( SSD / (n() -1) )
)
## # A tibble: 1 x 5
##       N       sum   ave1          SSD      SD
##   <int>     <int>  <dbl>        <dbl>   <dbl>
## 1 53940 212135217 3932.8 858473135517 3989.44

group_by

summarise() is not terribly useful unless we pair it with group_by(). When you use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”.

TIP: group_by() is useful when calculating statistics per group. These statistics can then be easily compared.

TIP: Complicated models can also be built and then run on a group-by-group basis.

WARNING: When using group_by() with summarise() the groups get unwound after the summarise(). That means if you group by Var1 and Var2 after doing a summary the data frame will only be grouped by Var1. Thus the order of the variables used in the group_by() matter.

group_by

diamonds_grouped <- group_by(diamonds,cut)
summarise(
  diamonds_grouped,
  N = n(),
  average = mean(price),
  SD = sd(price)
)
## # A tibble: 5 x 4
##         cut     N  average       SD
##       <ord> <int>    <dbl>    <dbl>
## 1      Fair  1610 4358.758 3560.387
## 2      Good  4906 3928.864 3681.590
## 3 Very Good 12082 3981.760 3935.862
## 4   Premium 13791 4584.258 4349.205
## 5     Ideal 21551 3457.542 3808.401

%>% (pipe)

%>% is used to string functions together. This makes writing a set of logic clear and condensed.

diamonds%>%
  group_by(color, clarity)%>%
  summarise(n = n())%>%
  mutate(prop=n/sum(n))%>%
  plot_ly( x = ~color, y = ~prop, color= ~clarity,type = "bar",colors = pal_deloitte)%>%
  layout(barmode = "stack")

%>% (pipe)

Exercise

Using the transition data calculate the transistion matrix for each segment. A transition rate is defined as: \[p_{ij}=Pr({X_{t+1}=j|X_{t}=i})\] \[p_{ij}=\frac{\sum_n balance_{n,t} \times I(X_{n,t+1}=j|X_{n,t}=i)}{\sum_n balance_{n,t} \times I(X_{n,t}=i)}\] HINT: Make sure that your rows sum up to one